high fidelity speech synthesis
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.
Review for NeurIPS paper: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Strengths: (1) The paper proposes a new model named HiFi-GAN for efficient and high-fidelity raw waveform generation from mel-spectrogram. In addition to the existing Multi-Scale Discriminator (MSD), the discriminator also consists of a set of small sub-discriminators (called Multi-Period Discriminator, MPD). Each MPD handles a portion of periodic signals of input audio to capture the diverse periodic patterns underlying in the audio data.
Review for NeurIPS paper: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
This work initially received mixed reviews, but after the author feedback cleared up a misunderstanding, most reviewers are now recommending acceptance. Nevertheless, I think R2 (who has not raised their score) has some valid concerns, which I want to account for in my decision. I have decided to recommend acceptance. The experimental section of this work is fairly comprehensive, and adequately demonstrates that the proposed architecture is effective. However, it is important to point out that the majority of experiments was conducted using ground-truth mel-spectrogram conditioning, which does not match the usual practical setting of TTS systems, where the spectrograms are themselves generated by a model (and thus imperfect).
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU.